Regression Analysis: Gender Disparities and Political Influence

Authors
Affiliation

Team 8

Jianhao Hong

Boston University

Xinran Li

Boston University

Chialing Sung

Boston University

Zimo Zeng

Boston University

1 🎯Objectives

The objective of this regression analysis is to explore how gender composition, occupation types, and state-level differences influence job salary levels in the U.S. labor market. By applying linear regression and random forest models, we aim to identify key features that contribute to wage disparities and assess the predictive power of gender ratio as a factor.

2 ✍️ Model Inputs and Methodology

We built two regression models—Multiple Linear Regression and Random Forest—to predict salary using three inputs:

  • Female Ratio: the share of female workers in each occupation‑state cell
  • State: one‑hot encoded dummy variables for each state
  • Occupation: one‑hot encoded dummy variables for each broad occupational group
Code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model    import LinearRegression
from sklearn.ensemble        import RandomForestRegressor
from sklearn.metrics         import mean_squared_error, r2_score
import plotly.express as px
import pandas as pd

xlsx_path = "~/ad688-employability-sp25A1-group8-1/data/gender.xlsx"
job_posting_path = "~/ad688-employability-sp25A1-group8-1/job_postings.csv"

df_gender = pd.read_excel(xlsx_path, sheet_name="2023", engine="openpyxl")
df_gender["female_ratio"] = df_gender["women"] / df_gender["total"]

df_jobs = pd.read_csv(job_posting_path)
df_jobs["NAICS2"] = pd.to_numeric(df_jobs["NAICS2"], errors="coerce")

# 2. NAICS2 → Occupation
naics_to_occupation = {
    11: "Farming, fishing, and forestry occupations",
    21: "Natural resources, construction, and maintenance occupations",
    22: "Production, transportation, and material moving occupations",
    23: "Construction and extraction occupations",
    31: "Production, transportation, and material moving occupations",
    42: "Sales and office occupations",
    44: "Sales and office occupations",
    48: "Production, transportation, and material moving occupations",
    51: "Computer and mathematical occupations",
    52: "Business and financial operations occupations",
    53: "Sales and office occupations",
    54: "Professional and related occupations",
    55: "Management occupations",
    56: "Office and administrative support occupations",
    61: "Education, training, and library occupations",
    62: "Healthcare practitioners and technical occupations",
    71: "Arts, design, entertainment, sports, and media occupations",
    72: "Food preparation and serving related occupations",
    81: "Personal care and service occupations",
    92: "Public Administration",
    99: "Unclassified"
}
df_jobs["Occupation"] = df_jobs["NAICS2"].map(naics_to_occupation)

df_merged = df_jobs.merge(
    df_gender[["occupation","female_ratio"]],
    left_on="Occupation", right_on="occupation",
    how="left"
)

df_merged["gender_category"] = df_merged["female_ratio"].apply(
    lambda x: "Female-dominated" if x>=0.55
              else ("Male-dominated" if x<=0.45 else "Mixed")
)


df_reg = df_merged.dropna(subset=[
    "SALARY",
    "female_ratio",
    "STATE_NAME",
    "Occupation"
])

X = df_reg[["female_ratio","STATE_NAME","Occupation"]]
X = pd.get_dummies(X, columns=["STATE_NAME","Occupation"], drop_first=True)
y = df_reg["SALARY"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=688
)

models = {
    "LinearRegression": LinearRegression(),
    "RandomForest":     RandomForestRegressor(n_estimators=100, random_state=688)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[name] = {
        "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
        "R2":   r2_score(y_test, y_pred)
    }

for name, mets in results.items():
    print(f"{name:17s} → RMSE: {mets['RMSE']:.2f},  R²: {mets['R2']:.3f}")

corr = df_reg[["SALARY","female_ratio"]].corr().iloc[0,1]
print(f"\nCorrelation(SALARY, female_ratio): {corr:.3f}")

rf = models["RandomForest"]
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances = importances.sort_values(ascending=False).head(10)
print("\nTop 10 feature importances from RandomForest:")
print(importances)
LinearRegression  → RMSE: 42525.67,  R²: 0.121
RandomForest      → RMSE: 42364.81,  R²: 0.127

Correlation(SALARY, female_ratio): -0.185

Top 10 feature importances from RandomForest:
female_ratio                                                0.474548
Occupation_Education, training, and library occupations     0.047786
STATE_NAME_California                                       0.042036
Occupation_Business and financial operations occupations    0.030760
STATE_NAME_New York                                         0.024201
STATE_NAME_Washington                                       0.019251
STATE_NAME_Texas                                            0.018477
Occupation_Computer and mathematical occupations            0.014710
STATE_NAME_Virginia                                         0.013087
STATE_NAME_Oregon                                           0.012776
dtype: float64
Code
import plotly.express as px
df_importances = pd.DataFrame({
    "Feature": importances.index,
    "Importance": importances.values
})

fig = px.bar(
    df_importances.head(10),
    x="Importance",
    y="Feature",
    orientation="h",
    title="Top 10 Feature Importances"
)

fig.update_layout(
    xaxis_title="Importance",
    yaxis_title="Feature",
    yaxis=dict(autorange="reversed"),
    template="plotly_white",
    height=600,
    width=800
)

fig.write_image("_output/regression_1_feature_importance.png")
fig.show()

The Pearson correlation between salary and female_ratio is –0.182, indicating a modest negative relationship: occupation/state cells with higher female shares tend to pay slightly less on average.

3 🔍 Implications for Job Seekers

  1. Gender Composition & Pay Gap
    • Higher female representation correlates with lower average pay, reflecting occupational gender segregation and compensation gaps.
    • Female job seekers might consider targeting occupations or regions with more balanced—or male‑dominated—workforces to maximize compensation potential.
  2. Geographic Differences
    • Roles in California, New York, and Washington tend to pay above the national reference level. If relocation is an option, applying in these states may yield higher offers.
  3. Occupational Targets
    • Occupations such as education/training and financial operations rank highly in feature importance, suggesting they are particularly predictive of salary.
    • Technical and professional categories (e.g., computer/math, professional services) also show positive contributions—candidates with skills in these areas may command higher salaries.